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Abstract 

Logistic and Cox regression methods are practical 
tools used to model the relationships between certain 
student learning outcomes and their relevant explanatory 
variables. The logistic regression model fits an S-shaped 
curve into a binary outcome with data points of zero and 
one. The Cox regression model allows investigators to 
study the duration and timeline of the critical events, 
which are also a binary and dichotomous measure. This 
paper introduces logistic and Cox regression models by 
illustrating examples, implementing step-by-step SPSS 
procedures, and further comparing the similarities and 
differences of the model characteristics. Logistic regression 
analysis was conducted to investigate the effects of the 
explanatory variables such as pre-admission variables, 
college cumulative GPAs, and curriculum tracks on student 
licensure examination. Moreover, logistic regression 
analysis was employed to quantify the effect (odds or 
odds ratio) of specific explanatory variables on the binary 
outcome holding other variables constant. With regards to 
Cox regression analysis, the outcome variable of interest 
was the timing of experiencing academic difficulty — 
dismissal, withdrawal, and leave of absence. The Cox 
regression model was used to detect when students were 
most likely to experience academic difficulty beyond their 
matriculation. The model also allowed the investigators to 
measure the effect (relative hazard or hazard ratio) of 
specific risk factors on the academic difficulty after adjusting 
for other factors. Identifying the occurrence of critical 
events along with the explanatory variables, college 
administrators and faculty could implement intervention 
strategies to ensure student success. 

Introduction 

Regression analyses are statistical procedures that 
describe the relationship between an outcome (dependent, 
or response) variable and one or more explanatory 
(independent, or predictor) variables or risk factors. The 



choice of regression models depends largely on the 
research objectives and the measurement scales of the 
outcome variable in the study. Logistic regression analysis 
is a suitable technique to investigate the effects of the 
explanatory variables on the binary outcome, either success 
or failure. Moreover, it allows investigators to perform 
predictions based on the resulting model. The Cox 
regression model becomes the appropriate choice for 
studying the risk factors in relation to the duration and 
timeline until occurrence of the critical event, which is also 
a binary measure. However, the model itself does not gear 
up for the purpose of future predictions. 

Logistic regression analysis has been widely used to 
study student performance, enrollment, graduation, and 
post-graduate employment status, which are restricted to 
a binary outcome such as ’’success or failure’, ‘enrolled or 
not enrolled’, ‘graduated or not graduated’, and ‘primary 
care or non-primary care specialty’ (Case, et al., 1994; 
Sadler, et al, 1 997; Strayhorn, 2000; and Hojat, 1 995). The 
logistic regression method allows investigators to estimate 
and interpret the effects of the explanatory variables on the 
binary outcome. The maximum likelihood estimation 
technique is readily available to estimate the regression 
coefficients for the non-linear model equation. This 
technique maximizes the probability of getting the observed 
data given the fitted regression coefficients (Hosemer and 
Lemeshow, 1989; Eliason, 1993; and Pampel, 2000). The 
model equation renders itself to perform future predictions 
to assess the predictive power. In addition, investigators 
can transform the model equation into the odds and odds 
ratio of the event occurrence. The odds or odds ratio is a 
measure of the direction and strength of the relationship 
between the explanatory variables and the effects of such 
variables on the binary outcome. The logistic regression 
model quantifies the association between the explanatory 
variables and the outcome variable of interest after adjusting 
for other explanatory variables (Matthews and Farewell, 
1996). Therefore, it is a useful tool to analyze the 
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dependence of a binary outcome on a set of explanatory 
variables. 

Cox regression analysis is a branch of survival analyses, 
which is used to analyze the timeline until a critical event 
occurs (Cox, 1972; Cox and Oakes, 1984; and Miller 
1981). Survival analyses discussed in this paper appear 
in diverse fields under a variety of names. In biomedical 
and health sciences, these are called survival analyses 
because the event of interest is mortality or death. In 
sociological research, these are often referred to as event 
history studies. In engineering studies, the term reliability 
theory is commonly applied to the lifetime of an object. 
Cox regression analysis allows investigators to answer 
two questions simultaneously: "Has the critical event of 
interest occurred?” and "Which risk factors contribute to 
the event occurrence?" (Singer and Willett, 1 991 ). In Cox 
regression analysis, the hazard and survival functions are 
expressed as a function of time and the risk factors, 
which enable investigators to address research questions 
in education such as: ”How many semesters elapse 
before students drop out of school?”, "At what year are 
students likely to graduate from the college?”, and ’What 
explanatory variables contribute to students’ dropout or 
graduation?” Numerous studies related to the timing of 
student departure from college were conducted during the 
past decade (DesJardins et al., 1997; Han and Ganges, 
1995; Huff and Fang, 1999; Ronco, 1994; Singer and 
Willett, 1 993; and Willett and Singer, 1 991 ). Moreover, in 
the Cox regression model, investigators can determine 
the effectiveness of college programs by comparing the 
patterns of the survival functions. For example, if the 
survival function for Group A consistently lies above that 
of Group B, the intervention program would appear more 
effective for Group A during the study period. 

In the Cox regression model, the hazard function is a 
function of a set of risk factors and baseline hazard. The 
risk factors are similar to independent variables of the 
linear regression model except they appear to be non- 
linear in the exponential expression. The baseline hazard 
is similar to the constant or intercept of the linear regression 
model. It describes the overall level of risk and reveals the 
main effect of the time variable (Singer and Willett, 1991). 
The hazard function displays the timeline fluctuation of a 
student experiencing the critical event. The student with 
the greater risk has a higher value of the hazard function 
as compared to one with the lesser risk at that same 
particular time. Thus, a high hazard function indicates the 
critical event is more likely to occur. Conversely, a low 
hazard function shows that the critical event is less likely 
to occur (Kleinbaum, 1996). For instance, investigators 
can detect the ineffectiveness of the support services 
program by comparing the hazard functions. If the hazard 
function is higher for Program A than Program B, Program 
A appears to be ineffective during the study period. 

Similar to the logistic regression analysis, the Cox 



regression model allows investigators to estimate and 
interpret the effects of the risk factors on the event occurring. 
The partial maximum likelihood estimation technique is 
used to estimate the regression coefficients of the model 
equation (Hosemer and Lemeshow, 1 989; and Kleinbaum, 
1996). The interpretation of the regression coefficients in 
the Cox regression model is virtually the same as in 
logistic regression analysis. For the positive regression 
coefficient, the hazard of a student experiencing the critical 
event increases as the value of the risk factor increases. 
Moreover, if the regression coefficient is zero, the value of 
the relative hazard becomes one, indicating that the hazard 
of a student experiencing the critical event is not affected 
by the risk factor. However, for the negative regression 
coefficient, the hazard of a student experiencing the critical 
event decreases when the value of the risk factor increases. 
For example, if student dropout is the critical event of 
interest, the Cox regression model can be used to study 
the timeline and the risk factors associated with student 
dropout. 

The model assumption of Cox regression is very different 
from that of the logistic regression model. Logistic 
regression assumes that residuals are normally distributed 
with a mean of zero and a constant variance. However, the 
Cox regression model assumes that the hazard ratio for 
different students with different values of the risk factor is 
independent of the time variable (Belle, et. al., 2004; and 
Kleinbaum, 1 996). Thus, when the two hazard functions 
are proportional, the title ‘proportional hazards model’ is 
applied to Cox regression. The proportionality of the model 
assumption implies that the hazard ratio is constant over 
time. Therefore, the verification of the model assumption 
is essential to ensure the model appropriateness. 

In an attempt to demonstrate the usefulness of the two 
methods, this paper provides a brief overview and example 
illustrations. In particular, it offers step-by-step SPSS PC 
commands used for analyzing a binary result or event 
occurring related to student-learning outcome. In this 
study, the research objectives and questions are defined 
as the foundation of the investigation. Next, the study 
variables are gathered to form a research-oriented database, 
which are extracted from the existing student tracking 
system. In addition, the principle of model strategies 
serves as a guideline to build the models. It comprises all 
relevant variables at the initial phase of the model fitting 
and achieves parsimony and consistency upon model 
completion. The paper focuses mainly on the model 
equations, the assessment of model fittings, the 
interpretation of odds ratio and hazard ratio, and the 
importance of the model assumptions. Moreover, the two 
methods are placed side by side to analyze the similarities 
and differences of their characteristics. 
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Literature Review 

Using logistic regression analysis (Sadler et al., 1 997), 
the second semester enrollment was discovered to be 
significantly associated with the explanatory variables 
such as tuition benefits, state residency, race, high school 
GPA, and summer orientation. The results of this study 
could help the institution implement appropriate 
interventions that target students at-risk for leaving. To 
investigate student retention at Historical Black Colleges 
and Universities, a logistic regression model was 
constructed (McDaniel and Graham, 1999). The research 
findings showed that students were more likely to return 
for the second year if they earned higher ACT test scores. 
In addition, students aspiring to achieve doctoral and 
professional degrees were more likely to persist than 
students with lower degree aspirations. These study results 
could assist college administrators and faculty to select 
students who are more likely to succeed at their institution. 

By using logistic regression analysis, one study found 
that student performance in the college program predicted 
whether or not students would apply to medical school, 
get accepted by medical school, and graduate from medical 
school (Strayhorn, 2000). In constructing this logistic 
regression model, the significant explanatory variables for 
the probability of passing the United States Medical 
Licensure Examination (USMLE) Step 1 were medical 
college admission test scores, medical school freshman 
GPAs, sophomore course performances, and financial aid 
support (Chen et al., 2001 ). These study results could be 
used to document the effectiveness of academic programs 
and applicant screening. Logistic regression analysis (Case 
et al., 1994) was also conducted to investigate the 
relationship between the initial performance of identifying 
examinees and the ultimate pass rates of the USMLE 
Steps 1 and 2. The study results showed that the probability 
of ultimately passing both Steps 1 and 2 was significantly 
related to the initial score achieved. Based on the availability 
of student performance measures, professional activities, 
satisfaction results, and research productivities, a logistic 
regression model was able to predict primary care and 
non-primary care status from the significant predictors, 
which include specialty interest, professional plan, and 
interests expressed in medical school (Hojat, 1995). The 
study results could be used to document the different 
tracks of physician training and education. 

By constructing hazard models for different college 
exiting modes (graduation, withdrawal, transferal, dismissal, 
and leave of absence), investigators could better understand 
the different factors that influence student behavior (Singer 
and Willett, 1 991 ). A survival analysis study indicated that 
the academic resource index significantly influenced 
graduation (DesJardins and Moye, 2000). Using survival 
analysis, another study (Han and Ganges, 1995) revealed 
the occurrence of crisis for a selected group of students 
who persisted for four or more years and still left their 



university prior to graduation. The results from this study 
suggest that students in the risk group participate in the 
intense and sustained intervention program. 

The result of survival analysis indicated that some 
students experiencing academic difficulty remained at 
risk throughout the first three years of medical school 
(Fang, 2000). The research finding implied that academic 
support programs focusing only on the entering year might 
not be sufficient to fully address this extended period of 
risk. The study results of survival analysis (Huff and Fang, 
1 999) demonstrated that the increase of the relative risks 
of students experiencing academic difficulty were 
associated with low MCAT scores, low science GPA, low 
undergraduate institutional selectivity, being a woman, 
being a member of a racial-ethnic underrepresented 
minority, or being older. Clearly, investigators in higher 
education can use survival analysis to identify students 
who are most likely to experience the occurrence of 
critical events — dropout, withdrawal, dismissal, and delay 
of graduation. Knowing the time-to-event occurrence and 
related risk factors, college administrators and faculty can 
effectively implement the intervention strategies to increase 
the likelihood of student success. 

Logistic Regression Equation 

The logistic regression model is primarily written as Y 
= P(X) -I- E, where Y is the binary outcome — event occurring 
coded as 1 or event not occurring coded as 0 (Hosemer 
and Lemeshow, 1 989). The probability P(X) of obtaining 
the binary outcome is considered to be the estimated 
value given the explanatory variables (X) are known 
observations. The error term (E) also called the residual, 
represents the difference between the actual binary 
outcome (Y) and the estimated probability P(X). The model 
is commonly written as P(X)=e^/(1-i-e^) or the equivalent 
form of P(X) = 1/(1 H-e*^), where Z stands for a linear 
combination of -i- P^X^ -i- p^Xg-r...-!- p^X^ (Hosemer and 
Lemeshow, 1989). The ”e” term in the equation is the 
base of the natural logarithm, which is approximately 
2.718. The regression coefficients (p) are unknown 
parameters to be estimated. Moreover, the model assumes 
that residuals have a mean of zero and a constant variance 
of P(X)[1 -P(X)], which are statistically independent of one 
another. 

In a logistic regression model, the probability of a 
student obtaining a binary outcome is always in the range 
of zero to one, regardless of the value of Z. When Z is 
negative infinity for the model equation P(X) = 1/(1 -re'^), 
the probability of a student obtaining a binary outcome is 
virtually zero (e‘^ becomes positive infinity because of 
minus negative infinity of Z; and one divided by one plus 
positive infinity equals zero). Also, when Z increases from 
negative infinity to zero for the model equation P(X)=e^/ 
(1-re^), the probability of a student obtaining a binary 
outcome increases from zero to one half (e^ equals one 
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when Z is zero; and the model equation shows that one 
divided by two equals one half). When Z increases from 
zero to positive infinity for the model equation P(X)=e^/ 
(1+e^), the probability of a student obtaining a binary 
outcome increases from one half to one (positive infinity 
of e^ divided by one plus positive infinity equals one). The 
logistic regression equation is the steepest when the 
probability equals 0.5 and flattens out on both top and 
bottom tails as the probability value approaches zero and 
one (Dey and Astin, 1993). All of the characteristics 
mentioned above allow investigators to fit the S-shaped 
curve into a data set of binary outcomes containing values 
of zero and one. 

The logistic regression equation is a non-linear curve 
rather than a straight-line. Thus, the parameter estimation 
method for the logistic regression model is called the 
maximum likelihood estimation, which is different from 
the least-square estimation method for the linear regression 
model. Therefore, it is necessary to use the iterative 
process via the computer software to find better 
approximations of the logistic parameters that satisfy the 
log likelihood equation. When the log likelihood equation 
is satisfied or maximized, the probability of an individual 
obtaining the observed data is maximized (Eliason, 1 993; 
Pampel, 2000; and Hosemer and Lemeshow, 1989). In 
other words, the solution of the log likelihood equation 
implies that the effect of the explanatory variables on the 
probability of event occurrence is also maximized. 

When all regression coefficients are estimated, the 
values of the explanatory variables can be plugged into 
the logistic equation to perform predictions. For example, 
if the probability of a student obtaining a binary outcome 
is greater than or equals to a cut-off point (defaulted value 
of one half) that student is placed into the success group. 
On the other hand, if the probability of a student obtaining 
a binary outcome is less than one half, then that student 
is categorized into the failure group (SPSS, Inc, 2002). 
Therefore, by comparing the predictive results and actual 
observations, investigators can calculate the prediction 
accuracy for the success and failure groups, as well as 
for the combined success and failure group. The cut-off 
point for predicting the binary outcome can be adjusted 
either upward or downward in order to increase specificity 
(accuracy of the prediction results for the failure group) or 
sensitivity (accuracy of the prediction results for the 
successful group) of the model equation. 

T 0 assess the overall model fitting and the significance 
of specific explanatory variables, the logistic regression 
model allows investigators to perform the likelihood ratio 
and Wald tests. The likelihood ratio test compares the 
likelihood for the intercept only model to the likelihood for 
the model with the explanatory variables (Eliason, 1 993; 
Pampel, 2000; and Hosemer and Lemeshow, 1989). The 
logic of hypothesis testing for the likelihood ratio test in 
logistic regression is similar to the F test in the linear 



regression model. Investigators may conclude that at 
least one of the explanatory variables contributes to the 
probability of the student obtaining a binary outcome if the 
p value is less than the predetermined significance level 
^=.01, .05, or .001). In logistic regression analysis, the 
Wald statistic is commonly used to test the significance 
of the individual logistic regression coefficients (Hosemer 
and Lemeshow, 1 989). Moreover, the logic of hypothesis 
testing for the Wald test in logistic regression is similar to 
the t test in the linear regression model. In the case that 
a specific regression coefficient is significantly different 
from zero, the corresponding explanatory variable 
significantly contributes to the probability of a student 
obtaining the binary outcome. 

Interpretations of Odds, Log Odds, Odds 
Ratio, and Delta P 

By means of the mathematical transformation, 
investigators can transform an estimated probability into 
odds, log odds, odds ratio, and delta p, respectively. An 
example of a simple logistic regression model containing 
only one explanatory variable is used to illustrate this 
transformation process and interpret the resulting statistics. 
The odds can be derived as a ratio of the two probabilities, 
which is written as odds = P(event occurring)/P(event not 
occurring) = P(X)/[1-P(X)] = where P(X) = 

[1 -I- (Hosemer and Lemeshow, 1989). The term 

e<*° refers to the value of the odds for the constant and e® 
represents the value of the odds related to the explanatory 
variable. The odds (e^) can be interpreted as a stand-alone 
statistic without the involvement of the change (increase 
or decrease) of the explanatory variable. For instance, 
when the value of the odds equals three (0.75/0.25), it 
means that the odds of student obtaining a binary outcome 
(event occurring) is three times higher than that of the 
same student not obtaining a binary outcome. However, if 
the value of the odds equals one (0.5/0. 5=1 ), the odds of 
obtaining a binary outcome is equivalent to the chance of 
obtaining a head when flipping a fair coin. 

Because the base of the natural logarithm is applied to 
both sides of the odds equation mentioned above, the log 
odds known as logit can be written as log^(odds) = 
logJP(X)/[1-P(X)]} = H- px (Hosemer and Lemeshow, 
19M). Given the linear relationship exists between the 
dependent variable (log odds) and the independent variable 
(X), the interpretation of the effect of a specific explanatory 
variable is quite similar to linear regression analysis. The 
log odds can be interpreted as a unit change in the 
explanatory variable leads to the magnitude change (a 
units) of the log odds of a student obtaining a binary 
outcome. The log odds provides only the direction of the 
relationship, but is limited in providing meaningful 
information about the effect of the explanatory variable. As 
a result, it is difficult to understand the meaning of log 
odds. Therefore, the odds ratio should be used to interpret 
the effect of the explanatory variable. 
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The definition of odds ratio is different from that of the 
odds. Recaii from the previous paragraph, that the odds is 
a ratio of the two probabiiities when one is confined to the 
event occurring and the other refers to the event not 
occurring. The odds ratio is the ratio of two odds when the 
different vaiues of the expianatory variabie appiy to them. 
The odds ratio is known as odds change, which describes 
the proportionate change in the odds for one-unit difference 
in the expianatory variabie (Hosmer and Lemeshow, 1 989; 
and Menard, 1995). It is a measure of the association 
between the expianatory variabie and the odds of the 
individuai obtaining a binary outcome. 

An odds ratio of greater than one shows that the odds 
of an event occurring increases when the vaiue of the 
expianatory variabie increases (Menard, 1995). It 
demonstrates the existence of a positive reiationship 
between the expianatory variabie and the effect of that 
particuiar variabie. The odds ratio is simpiified to be an 
exponentiai expression (e‘). One can use the odds of ei^° 
where p > 0 to iiiustrate the positive reiationship and 
interpret the meaning of the odds ratio. For the odds of 
ePo + P when X=1 versus the odds of ef®° when X=0, the 
resuiting odds ratio is e\ which is a ratio of and ei^°. 
Again, for the odds of when X=2 versus the odds of 
eP° + P when X=1, the resuiting odds ratio is stiii e®, which 
is a ratio of + and For exampie, this ratio 
impiies that an average one-unit of increase in the 
expianatory variabie ieads to an increase in the odds of 
obtaining a binary outcome by a factor of e^. Aiso, using 
tutoriai program and student success as an exampie, if 
the odds ratio equais 2, it can be interpreted that an 
average one-month (time frame) of increase in the 
impiementation of a tutoriai program contributes to an 
increase in the odds of a student's success by a factor of 
2 . 

However, an odds ratio of iess than one indicates that 
the odds of an event occurring decreases when the vaiue 
of the expianatory variabie increases (Menard, 1 995). In 
this case, the odds ratio is simpiified to be an exponentiai 
expression (e '^). One can use the odds of where 
P > 0 to iiiustrate the negative reiationship and interpret 
the meaning of the odds ratio. When the expianatory 
variabie (X) increases by one unit (from X=0 to X=1 ), the 
odds change increases by a factor of e ^ which is the ratio 
of and ei^°. Moreover, when the expianatory variabie 
(X) increases by one unit (from X=1 to X=2), the odds 
change increases by a factor of e which is the ratio of 
eP°- 2 P and It is difficuit to convey the concept of odds 
change for a fraction between zero and one. Therefore, a 
better way of interpreting a negative regression coefficient 
is to say an average one-unit of increase in the expianatory 
variabie ieads to a decrease in the odds for obtaining a 
binary outcome by a factor of e^, which is an inverse of 
e‘i^ (0 < e‘i^ < 1) Using anxiety score and student success 
as an exampie, if the odds ratio equais 1/2, it can be 



interpreted that an average one-point scaie of increase in 
anxiety score ieads to a decrease in the odds of a 
student’s success by a factor of 2, which is the reciprocai 
of 1/2. 

Aithough odds and odds ratio are commoniy used for 
interpreting iogistic regression resuits, the deita-p statistic 
can be used to measure the effect of a specific expianatory 
variabie on the probabiiity of obtaining the binary outcome 
(Peng et ai., 2002). Using financiai aid and student success 
as an exampie, a deita-p of .10 for a student receiving 
financiai support is interpreted as increasing the probabiiity 
of a student’s success by 1 0% as compared to student 
who does not receive financiai support. The deita-p is 
transformed from the regression coefficient using a method 
originaiiy recommended by a researcher, Peterson, in 
1984 (Peng et ai., 2002). The odds ratio can be found in 
SPSS printouts when iogistic regression anaiysis is 
performed, however, the deita-p statistic cannot be 
generated by the SPSS PC Version 12.0 commands. 

Example of Logistic Regression Analysis 

A iogistic regression modei was constructed for a 
sampie of 200 matricuiated students randomiy seiected 
from the popuiation in 1 993-1 997. The study attempted to 
answer the research question, ”How weii can the USMLE 
Step 1 pass status be predicted by the expianatory 
variabies: demographics, admission test scores, 
undergraduate GPA, medicai schooi cumuiative GPA, 
course grades, and financiai aid support?" Thus, the 
research objectives of this study were: (a) to identify 
expianatory variabies that significantiy contribute to the 
probabiiity of passing the USMLE Step 1; and (b) to 
provide insight into the measure (odds or odds ratio) of the 
effects of the expianatory variabies. 

In this study, the outcome or response variabie was 
binary, i.e. USMLE Step 1 pass status, which was (iabeied 
as p-f-grp for the SPSS command), coded 1 for the pass 
group and 0 for the faii group. The probabiiity of a student 
passing USMLE Step 1 was a continuous scaie between 
zero and one. The expianatory variabies were a combination 
of the discrete measure (gender, race, and medicai schooi 
curricuium track) and the continuous measure 
(undergraduate basic sciences average, undergraduate 
GPA, medicai coiiege admission test scores — MCAT 
physicai sciences, MCAT bioiogicai sciences, MCAT verbai 
reasoning scores, medicai schooi freshman GPA, number 
of sophomore courses faiied, and financiai aid ioan 
amount). The coding scheme for the discrete measurement 
was gender (1 for maie and 0 for femaie), ethnicity (1 for 
African American and 0 for Non-African American), 
historicai biack coiieges and universities status (1 for 
HBCU graduate and 0 for Non-HBCU graduate), and medicai 
schooi curricuium track (iabeied as curr_grp for the SPSS 
command, 1 for four-year curricuium track and 0 for five- 
year curricuium track). 
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The logistic regression model was constructed using 
the forward selection procedure. At each step, the 
explanatory variable with the smallest significance level 
for the Wald statistic was entered into the model. The 
default entry criterion for the explanatory variables was a 
p value of .05. The Wald statistics for all variables in the 
model were examined and the explanatory variable with 
the largest p value for the Wald statistic was removed 
from the model. The default removal criterion was p=.10. 
If no explanatory variables met the removal criterion, the 
next eligible variable was entered into the model. The 
iteration process for selecting explanatory variables 
continued until no additional variables met the entry or 
removal criterion. 

SPSS PC Commands for Logistic 
Regression Anaiysis 

The eight steps of the SPSS PC Version 12.0 
commands required to produce logistic regression analysis 
are as follows: 

Step 1 - Click Analyze, click Regression, and click 
Binary Logistic, Step 2 - Click on dependent variable 
(p_f_grp), and click <right arrow> sign to move it to the 
dependent box; Step 3 - Hold down the CTRL key, click 
all independent variables (ung_bsa, ung gpa, mcat_vr, 
mcat_ps, mcat_bs, curr_grp, gender, ethnic, hbcu, 
course2f, fresh _gp, and ioan_amt), and click <right arrow> 
sign to move them to the covariates box; Step 4 - Click 
<down arrow> sign to display the method options and 
select Forward-Wald\ Step 5 - Click Categorical button-. 
Step 6 - Hold down the CTRL key, and click on categorical 
variables (curr_grp, gender, ethnic, and hbcu) and click 
<right arrow> sign to move them to the categorical 
covariates box, and click Continue, Step 7 - Click the 
Option button, select classification plots, click display At 
Last Step, and click 

Continue; and Step 8 - Click OK. Note that Step 8 - 
Click Paste to generate LOGISTIC REGRESSION syntax 
command lines as follows: LOGISTIC REGRESSION 
P_Lgfp /METHOD = FSTEP(WALD) ung_bsa ung_gpa 
mcat_vr mcat_ps mcat_bs curr_grp gender ethnic 
hbcu course2f fresh_gp loan_amt /CONTRAST 
(curr_grp)=lndicator /CONTRAST (gender)=lndicator / 
CONTRAST (ethnic) = lndicator / CONTRAST 
(hbcu)=lndicator /CLASSPLOT /PRINT = SUMMARY / 
CRITERIA = PIN(.05) POUT(.IO) ITERATE(20) CUT(.5) . 

Major Findings for Logistic Regression Anaiysis 

The sample size of 200 in this study was appropriate 
because the ratio (1 to 50) of the number of the explanatory 
variables (4) to the sample size (200) exceeded the 
minimum ratio of 1 to 10 recommended for the logistic 
regression study (Peng et al., 2002). It is crucial to 
assess the model collinearity, model assumption, model 
fitting, and model accuracy prior to interpretation of the 



research findings. The collinearity exists if one explanatory 
variable is a function of the other explanatory variables. 
There was no evidence of model collinearity because the 
tolerances (TOL = 1 - R.^ where R. is a multiple correlation 
coefficient between one explanatory variable (X) and the 
other explanatory variables (Xs) in the equation were large 
(a range of .571 to .971) based on a linear regression 
printouts (Norusis, 1985; and Menard, 1995). Moreover, 
the assumption of the logistic regression model was 
satisfied because the histogram of residuals appeared 
normally distributed with a mean of zero, and the residuals 
on the scatter diagram also appeared to be parallel with 
the X-axis, the indication of a constant variance. 

As shown in the footnote of Table 1, the logistic 
regression model containing the four explanatory variables 
fits the data well based on the model chi-square test 
(X2=83.40, df=4, and p<.001 ) and the small value of the - 
2 log likelihood (1 59.23). Also, the improvement chi square 
value (X^=38.01, df = 4, and p<.001) indicated that the 
model fits the data better than it had initially. The model- 
fitting statistic, namely the pseudo R-square, measured 
the success of the model by explaining the variations in 
the data. The pseudo R-square for Nagelkerke (0.49) was 
significantly different from zero. This indicated that forty- 
nine percent of variations in the binary outcome variable 
was accounted for by the explanatory variables. Finally, 
the prediction accuracy of 92% for the pass group and 
84% for the combined pass and fail group were high. 
However, the prediction accuracy of 63% for the fail group 
was not high. 

As illustrated in Table 1 , the regression coefficients for 
the MCAT physical sciences score, MCAT biological 
sciences score, number of sophomore courses failed, and 
medical school freshman GPA were significantly different 
from zero at the .001 significance level using the Wald 
test. It was evident that these four explanatory variables 
significantly affected the USMLE Step 1 pass status. 
However, the research findings indicated that the 
undergraduate basic sciences average, undergraduate 
GPA, MCAT verbal reasoning score, gender, ethnicity, 
HBCU status medical school curriculum track, and financial 
aid support were not significantly associated with the 
licensure examination performances. 

The logistic regression method yielded the following 
logistic regression equation to predict the USMLE Step 1 
pass status: the estimated probability (Passing USMLE 
Step 1) = P(X) = e^/ (1 -I- e^), where e is the base of the 
natural logarithm, approximately 2.718; and Z = - 12.21 
-I- 0.62 * MCAT Physical Sciences Score -i- 0.51 * MCAT 
Biological Sciences Score - 2.53 * Number of Sophomore 
Courses Failed -1-1.89* Medical School Freshman GPA. 
Based on the contribution from each of the explanatory 
variables, the estimated probability can be obtained from 
this equation for a particular student. It can be said that 
if the estimated probability is greater than or equal to 0.5, 
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Table 1 

Logistic Regression Model for Predicting 
USMLE Step 1 Pass Status 



Variables in 
the Equation 


Logistic 
Regression 
Coefficient (8) 


Standard 
Error of 8 SE (8) 


Odds Ratio (e^) 


MCAT Physical 

Sciences Score 


0.62*** 


0.209 


1.87 


MCAT Biological 
Sciences Score 


0.51*** 


0.144 


1.66 


Number of Sophomore 
Courses Failed 


-2.53*** 


0.639 


0.08 


Medical School Freshman GPA 


1 .89*** 


0.533 


6.59 


Constant 


-12.21*** 


2.317 


0.00 


p < .001 based on the Wald test [X^ = (f3/SE(f3))^ with df=1] 

N=200 

-2 Log Likelihood: -2LL=1 59.23 
Model Chi Square Test: X2=83.40, df=4, and p<.001 
Improvement Chi Square Test: X2=38.01, df=4, and p<.001 
Pseudo R-square: Nagelkerke R-square = 0.49 

Prediction Accuracy: Pass Group (92%), Fail Group (63%), and Combined Group (84%) 



a student advances to the pass group of the USMLE Step 
1 . However, if the estimate probabiiity is iess than 0.5, a 
student faiis into the faii group of the USMLE Step 1 . 

The effect (odds ratio) of each expianatory variabie on 
the pass status of the USMLE Step 1 is shown in the iast 
coiumn of Tabie 1 . When an average of the medicai schooi 
freshman’s GPA increased by one point, the odds of a 
student passing USMLE Step 1 increased by a factor of 
6.59. This vaiue indicates that the effect of the medicai 
schooi freshman GPA on the USMLE Step 1 performance 
was very high. A change in the MCAT scores definiteiy ied 
to a change in the USMLE Step 1 performances. If an 
average of the MCAT physicai sciences score increased 
by one point, the odds of a student passing USMLE Step 
1 increased by 87%. Meanwhiie, when an average of the 
MCAT bioiogicai sciences score increased by one point, 
the odds of a student passing USMLE Step 1 increased 
by 66%. Furthermore, the odds of a student passing 
USMLE Step 1 increased by a factor of 0.08 when the 
number of courses faiied in sophomore year increased by 
one. In other words, the odds of a student passing USMLE 
Step 1 decreased by a factor of 12.5 (an inverse of 0.08) 
because of one additionai sophomore course faiied in the 
basic science discipiines. This impiied that the effect of 
sophomore course performances on the USMLE Step 1 
pass rate was extremeiy high. A noteworthy finding is that 
the odds of a student passing the iicensure examination 
depended heaviiy on sophomore course performances 
and medicai schooi freshman GPAs as compared to 
those of pre-admission variabies — MCAT physicai sciences 
and bioiogicai sciences scores. 

In summary, the medical school freshman GPA and 
sophomore course performance were significant explanatory 
variables for the USMLE Step 1 performance. The two 
MCAT scores (physical and biological sciences) were 
also significant explanatory variables, regardless of other 
pre-admission variables such as gender, ethnicity, and 
undergraduate BSA and GPA scores. As expected, the 



medical school freshman GPA was strongly correlated 
with the Step 1 performance while the number of courses 
failed during sophomore year was negatively associated 
with the Step 1 performance. It is evident that basic 
science disciplines have a predictive power for the medical 
licensure examination. This may imply that the medical 
school has implemented an adequate basic sciences 
curriculum, course instructions, and student assessment 
compared to other medical schools in the nation. 

Survival Analysis 

The Cox regression model can be used to study the 
risk factors that are significantly associated with the 
timing of the critical event. The timeline known as survival 
time refers to the length of the time interval (month, 
semester, or year) between the onset time of the study 
and the timing of the event occurrence (Steinberg, 1 999). 
The critical events cover not only negative and unpleasant 
experiences, such as dropout and academic difficulty, but 
also positive and pleasant results, such as passing certain 
examinations and graduation. 

Survival analysis is a unique statistical approach for 
analyzing uncensored and censored data in a single study 
(Singer and Willett, 1991). With regard to uncensored 
data, they are often referred to as a complete set of 
timelines about the event occurring as opposed to some 
censored cases for which the event of interest does not 
occur during the study period (Kleinbaum, 1996; and 
Steinberg, 1999). For instance, if the graduation status 
within a two-year master’s program is the event of interest, 
the timely graduation in two years can be considered as 
the uncensored data. Similarly, if the academic difficulty 
(dismissal, leave of absence, and withdrawal) from a 
seven-year doctorate program is the critical event of interest, 
the time-to-event for students A, F, and G illustrated in 
Table 2 belongs to the uncensored data. The censored 
data provide only partial information of timelines, which are 
collected under these circumstances: (1 ) students do not 
experience the event of interest (academic difficulty) before 
the study ends (student B - graduated, student D - still 
enrolled); (2) student withdraws from the study (student C 
- deceased); and (3) student is lost to follow-up during the 
study period (student E - transferred out). Because the 
occurrence of time-to-event for students (students B, C, 
D, and E) is not evident and hidden from view, the term 
‘censored’ is applied to it. In essence, censoring occurs 
when investigators have some partial information about 
the time-to-event of the individual students, but they do not 
know the exact time variable. The survival time is measured 
in years from the time students enter the study. Students 
C and F entered the study at year two and one, respectively, 
which are different from the rest of the group who entered 
at the beginning of the study. 

To understand survival analysis, one needs to begin 
with the survival function S(t). As indicated by researchers 
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(Braun and Zwick, 1993; Elandt-Johnson and Johnson, 
1980; and Lee, 1992), the survival function can be 
expressed as S(t) = 1 - F(t) = P(T>t), where T represents 
a specific time of the event occurrence. In other words, 
the survival function cumulates the proportion of individuals 
surviving longer than the time variable, which is a 
complement of the cumulative distribution function. The 
survival function gradually decreases when more individuals 
experience the critical event. It depicts theoretically the 

Table 2 

Example of Uncensored and Censored 
Data for the Doctorate Program 

The Outcome Variable is Academic Difficulty: Dismissed, Withdrew, and Leave of Absence because of 
Academic Reasons 

Years Time Variable Status Variable* 

0 1 2 3 4 5 6 7 



A X (dismissed) 

B graduated 

C deceased 

D study end (still enrolled) 

E transferred out 

F X (leave of absence) 

G X (withdrew) 

0 1 2 3 4 5 6 7 

Years 

* Status variable is coded as 1 for event occurring (uncensored data) and coded 0 for event not 
occurring (censored data). 



survival experience of individuals from time zero to time 
infinity (Hosmer and Lemeshow, 1 999; Kleinbaum, 1 996; 
and Lee, 1992). The survival function has a probability 
value of one when the time variable equals zero and a 
probability value close to zero when the time variable 
approaches infinity. However, in practice, the estimated 
survival function is a step function rather than a smooth 
curve, which goes down at a step at each time interval to 
describe the survival events during the study period. 

If the survival function represents a positive or pleasant 
experience (e.g., surviving from dropout and academic 
difficulty), the hazard function h(t) is eventually considered 
as a negative or unhappy event (e.g., experiencing dropout 
and academic difficulty). The hazard function is a measure 
of the tendency of an individual to experience the critical 
event. Unlike the survival function, the hazard curve does 
not range from one to zero. Instead, it starts anywhere 
and goes up and down in any direction over time. The 
hazard function is designed to provide insight into the 
conditional failure rate (Kleinbaum, 1996). It explicitly 
refers to the instantaneous potential per unit time for the 
critical event to occur, given that individual has survived to 
a particular time (Hosmer and Lemeshow, 1999; 
Kleinbaum, 1996; and Lee, 1992). Investigators who have 
been exposed to calculus and mathematical statistics 
may desire to know the relationship of hazard and survival 
functions. 



Relationship between Hazard and 
Survival Functions 

The hazard function may be written as the conditional 
failure rate = h(t), where 

h(t)=lim{P[t < T < ( t+At) I T> t] / [At]} 

At — > 0 

=lim{P[t < T < ( t+At) and T > t] / [At P(T>t)]} 

At — > 0 

The definition of conditional probability, P(A|B) = P(A and 
B) / P(B) (Meyer, 1 970), is applied to the hazard function 
above. That is P(A|B)= P [t < T < (t+At) | T > tj; P(A and 
B) = P[t<T< ( t+At) andT >tj; and P(B) = P(T >t), where 
A represents [t < T < (t+At)j, a specific survival time (T) 
of the event occurrence in a narrow time interval between 
t and t + At, and B refers to (T > t), a specific survival time 
(T) of the event occurrence greater than the time variable 
(t). 

h(t)=lim(P[t < T < (t+At) and T > t] / [ At P(T>t) ]} 

At — > 0 

= lim(P[t <T< (t+At)j / [At P(T > t)]} 

At — > 0 

The method of combining sets (that is, events) to obtain 
a new set is applied to the numerators of the hazard 
function above. Given A and B are the two events, AnB is 
the event which occurs if and only if A and B occur (Meyer, 
1 970). Therefore, the joint events of A and B, AnB, or [t 

< T< (t+At) and T > t] equals the overlap, intercept, or [t 

< T < (t+At)j area of the event A [i.e., t < T< ( t+At)j, and 
the event B (i.e. T > t). 

h(t) = lim (P [t < T < (t+At)j / [At P(T>t)]}= f(t) / S(t) 
At — > 0 

The hazard function h(t) above becomes a ratio of the 
two probabilities: the probability density function f(t) and 
the survival function S(t). The equality of the formula is 
based on two mathematical relations: (1 ) probability density 
function f(t) = lim (P[t < T < (t+At)j / At} as At — > 0 (Lee, 
1992); and (2) S(t) = P(T>t) (Braun and Zwick, 1993; 
Elandt-Johnson and Johnson, 1980; and Lee, 1992). 

h(t) = f(t) / S(t) = (d[1 - S(t)]/dt} / S(t) = - (d [S(t)]/dt} / S(t) 

The equality of the hazard function above is derived 
from the substitutions of two mathematical expressions: 
(1 ) f(t) = d[F(t)]/dt; and (2) F(t) = 1 - S(t) into the equation 
h(t) = f(t) / S(t) (Braun and Zwick, 1993; Elandt-Johnson 
and Johnson, 1 980; and Lee, 1 992). Because the derivative 
(d/dt) with respect to the time variable is applied to F(t), 
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the hazard function can be reiterated as the change 
(siope) of an individuai experiencing the hazard per unit 
time given that individuai has survived ionger than time (t). 

There is a cieariy defined reiationship between the 
hazard and survivai functions. Investigators can derive the 
hazard function from information regarding the survivai 
function. As indicated by researchers, h(t) = - [d S(t) / dt] 
/ S(t) = - d[iog^S(t) / dt] (Braun and Zwick, 1 993; Eiandt- 
Johnson and Johnson, 1 980; and Lee, 1 992). Note that 
the cumuiative hazard function H(t) = - log^S(t) or H(t) = 
- in S(t) because the integrai (or sum) is appiied to both 
sides of this equation — h(t) =-d[iog^ S(t)/dtj. Investigators 
can aiso derive the survivai function from information 
regarding the hazard function. Because h(t) = - [d S(t) / 
dt] / S(t), the mathematicai expression of the survivai 
function can be written as S(t)=eL^’(“>'‘“i, where the integrai 
(J) is denoted by the area under the curve between u = zero 
and u = t (Eiandt-Johnson and Johnson, 1980; Kieinbaum, 
1996; and Lee, 1992). Therefore, it is ciear that the survivai 
function S(t) decreases as the hazard function h(t) 
increases, and vice versa. 

Kaplan-Meier Survival Analysis 

Before proceeding to the Cox regression modei, it is 
imperative to describe severai terms that are frequentiy 
used in survivai anaiysis: Kapian-Meier estimator, survivai 
function, conditionai probabiiity, and iog-rank test. The 
Kapian-Meier estimator aiiows investigators: (1 ) to use the 
definition of conditionai probabiiity to derive survivai functions 
for distinct groups; and (2) to determine the program 
effectiveness by comparing the survivai function among 
groups, respectiveiy. 

Because students’ survivai for the subsequent time 
intervai depends on students’ survivai from the previous 
time intervai, the Kapian-Meier formuia is mereiy the 
appiication of the conditionai probabiiity (Beiie, et ai. 
2004; and Kieinbaum, 1 996), which can be expressed as 
P(A and B) = P(B) P(A|B) as iiiustrated in Tabie 3. The 
formuia can be interpreted as the probabiiity of the 
occurrence of the joint events A and B equais the probabiiity 
of the event B muitipiied by the probability of the event A, 
given the occurrence of event B. Event A (After) refers to 
the occurrence of the student surviving for the subsequent 
time intervai. Event B (Before) represents the occurrence 
of the student surviving for the previous time intervai. 

Appiying the visuai examination method on the survivai 
curves, investigators can detect the difference between 
two survivai functions. Forexampie, if two survivai functions 
are separated in the first haif of the study period, but 
thereafter, are somewhat cioser to each other, then a 
iarge gap forms. This suggests that the intervention strategy 
is more effective eariier during the study period. In Kapian- 
Meier survival analysis, the log-rank test allows investigators 
to test the significant differences of survival functions at 
different follow-up times among the study groups 



(Kieinbaum, 1 996). By comparing the survival curves for 
the experimental group (with the intervention strategy) and 
the control group (without the intervention strategy), 
investigators can determine the effectiveness of the 
intervention strategy if the experimental group appears to 
be superior. 



Table 3 

Example of Calculating the Estimated Survival 
Function S(t) for the Doctorate Program* 



The Outcome (or Event) Variable of Interest is Academic Difficulty: Dismissed, Withdrew, and Leave 
of Absence because of Academic Reasons 


Time 
(in years) 
t. 


Risk 

Set’** 

R(D 


Number 
of Events 


Number 
of Censored 


Estimated S(t)*’ 
or % of Surviving 


0 


7 students' survival time > 0 year 


0 


0 


1 


2 


7 students' survival time > 2 years 


2 


1 


1x5/7 = .7143 


5 


4 students' survival time > 5 years 


1 


3 


.7143 X 3/4= .5357 


* Survival times (in years): 2*, 2', 3, S', 5, 6, and 7 come from Table 2, where 
occurrence of academic difficulty 

’* Kaplan Meier formula: P(B) P(A|B) = P(A and B), e.g., S(t=5)=(.7143) (3/4) = 
”* Each student in R(t;) has a survival time = \ 


- 1 - stands for the 
5357 



Cox Regression Equation 

Unlike the Kapian-Meier estimator, the Cox regression 
method allows investigators to generate the hazard function 
as a function of the time variable, risk factors, and baseline 
hazard. Investigators can calculate the measure (relative 
risk or relative hazard) of the risk factors and interpret the 
hazard ratio. The hazard ratio is the ratio of two hazard 
functions that allows investigators to measure the 
association between the risk factors and the effects of 
such factors on the risk functions. 

The Cox regression model can be expressed as h(t, X) 
= h^(t)e'^^’^' * (Hosmer and Lemeshow, 1999; 

Kieinbaum, 1996; and Lee, 1992). The hazard function, 
h(t, X), is a function of the baseline hazard h^(t) and the 
risk factors (X). The baseline hazard function is similar to 
the constant of the linear regression model, which 
represents the value of the hazard function before the risk 
factors are taken into account. The base of the natural 
logarithm is denoted by e that equals approximately 2.71 8. 
The risk factors may include continuous and categorical 
variables. The continuous variables, e.g., grade-point- 
averages and test scores are measured by an interval or 
ratio scale. The categorical variables, e.g., gender and 
course grades, are measured by a nominal and ordinal 
scale, respectively. The regression coefficients (P) are 
unknown parameters to be estimated by the partial 
maximum likelihood estimation approach (Hosemerand 
Lemeshow, 1989; and Kieinbaum, 1996). The term ‘partial’ 
likelihood is applied because the likelihood equation 
calculates probabilities only for cases of the event 
occurrence rather than all cases. For a specific value of 
the time variable, the hazard function depends on the 
quantities of +P 2 X 2 +...+ppxpgn(j ^ exhibits graphically 
the variation in the time-to-event occurrence (Kieinbaum, 
1 996). Because the hazard function is a non-linear curve. 
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it is necessary to use the iterative process to find better 
approximations of the regression coefficients that satisfy 
the partiai iikeiihood equation. 

To assess the modei goodness of fit, the Cox regression 
modei aiiows investigators to address the research 
question, "Does the modei containing the variabies in 
question teii us more about the outcome variabie than a 
modei that does not inciude these variabies?” (Wiiiett and 
Singer, 1991). In particuiar, it aiiows investigators to perform 
the iikeiihood ratio test, which compares the iikeiihood for 
the intercept oniy modei to the iikeiihood for the modei 
containing the risk factors within each modei. The iogic of 
hypothesis testing for the iikeiihood ratio test is the same 
for Cox regression and iogistic regression modeis. Based 
on the significance of the iikeiihood ratio test, investigators 
can ciaim that at ieast one of the risk factors contributed 
to the reiative hazard. In addition, Cox regression anaiysis 
uses the Waid test to examine whether or not each 
individuai regression coefficient is significantiy different 
from zero. The iogic of hypothesis testing for the Waid 
test is aiso simiiar for both the Cox and logistic regression 
modeis. If an individual regression coefficient is significantly 
different from zero, the corresponding risk factor significantly 
contributes to the hazard function of an event occurring. 

Cox regression analysis is known as the proportional 
hazards model because the model assumes that two 
hazard functions are proportional to each other over time 
(Kleinbaum, 1996). For example, given two students, if 
one initially has twice as much relative risk than the 
second student, then the relative risk for the first student 
is two times that of the second student at all time points. 
Also, given two student groups, if one group initially has 
three times the relative risk compared to the second 
group, the relative risk of the first group is three times 
more than that of the second group during the study 
period. In other words, the model assumes that the relative 
hazard is constant across time for two different students 
and student groups, respectively. 

Interpretations of Relative Hazard 
and Hazard Ratio 

The exponential expression of each regression 
coefficient in the Cox regression model is called relative 
risk or relative hazard. The magnitude of the relative 
hazard indicates the direction of the association between 
the outcome variable and the corresponding risk factor. If 
the relative hazard is greater than one (i.e. positive 
regression coefficient), the hazard of a student experiencing 
the critical event increases as the value of the risk factor 
increases. This implies that the relative hazard is positively 
associated with the risk factor. Moreover, if the relative 
hazard becomes one (i.e. regression coefficient is zero), 
the risk factor has no effect. On the other hand, if the 
relative hazard is a positive fraction and less than one (i.e. 
negative regression coefficient), the relative hazard of a 



student experiencing the critical event decreases as the 
value of the risk factor increases. This implies that the 
relative hazard is negatively associated with the risk factor. 

Investigators should focus on the hazard ratio (HR) 
when they study the effects of the risk factors on the 
critical event in Cox regression analysis. The value of the 
hazard ratio reflects the strength of the relationship between 
a specific risk factor and the effect of that factor. The 
hazard ratio is a ratio of two hazard functions, which can 
be expressed as HR = [h,y(t)] / [h,,^(t)] = [h^(t).e**^] / 
[h^(t).eP^*] = (Kleinbaum, 1996). The hazard ratio 
is the ratio of the relative hazard to the risk factor (X) 
changes, i.e., the relative hazard of X = 0 compared to the 
relative hazard of X = 1 . The hazard ratio decreasing or 
increasing is based on the positive or negative regression 
coefficient. The hazard ratio can also be interpreted as the 
multiplicative change, rather than additive change, in the 
hazard of a student experiencing the critical event based 
on every unit of the change in a specific factor, holding 
other factors as constant. If the hazard ratio of the M"' 
group (without the intervention strategy) versus the N"' 
group (with the intervention strategy) equals three, 
investigators may conclude that students in the M* group 
experience three times greater hazard compared to those 
in the N"" group. Moreover, if the risk factor (X) with a value 
of four is for student M and the same risk factor with a 
different value (X*) of three is for student N, the hazard ratio 
equals e^. Assuming the hazard ratio 

is two (ei^ =2, where p = log^2), the hazard of a student 
experiencing academic difficulty is two times greater for 
student M compared to student N. Note that the baseline 
hazard function, h^(t), appears in both the numerator and 
denominator terms of the hazard ratio, which can be 
cancelled out if the proportional assumption of the hazard 
function is held true. 

Example of Cox Regression Analysis 

A Cox regression model was used to analyze a sample 
of 200 matriculated students randomly selected from the 
population in 1 993-1 997. The study attempted to answer 
the following research questions: “How well can the following 
risk factors explain students encountering academic 
difficulty: demographics, undergraduate GPAs, medical 
college admission test scores, medical school academic 
performances, medical school curriculum tracks, and 
financial aid support amounts?” and “In which months do 
students experience the highest risk of academic difficulty 
based on the risk factors mentioned above?” Thus, the 
aims of the study were: (a) to identify risk factors that are 
significantly associated with the hazard ratio of students 
experiencing academic difficulty; (b) to provide insight into 
the measure (relative risk, relative hazard) of the effects of 
the risk factors mentioned above; and (c) to detect at what 
month students are most likely to experience academic 
difficulty. 

The outcome variable of interest was a continuous 
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measure — the hazard function of students experiencing 
academic difficuity. Academic difficuity refers to dismissai, 
withdrawai, and ieave of absence due to academic reasons. 
The time variabie was the survivai time in months. The 
time origin of this study was the matricuiation year, which 
was not at the same caiendar year for each matricuiating 
ciass. A combined data set for a five-year period was 
adopted for this study based on the justification of ciass 
comparabiiity in academic difficuity and risk factors. The 
study ended in the 36"^ month because no singie student 
experienced academic difficuity beyond the 36"" month 
within the study popuiation. The status variabie was iabeied 
as difficuit (cZ/ff/cu/f forthe SPSS command) and coded as 
1 for students who experienced academic difficuity and 0 
for students who did not experience academic difficuity 
during the study period. The risk factors with a continuous 
measure were undergraduate basic-sciences average 
(BSA), undergraduate GPA, medicai coiiege admission 
test scores (MCAT physicai sciences, bioiogicai sciences, 
and verbai reasoning scores), medicai schooi freshman 
GPA, number of sophomore courses faiied, and the financiai 
aid ioan amount. The risk factors with a discrete measure 
were gender (1 for maie and 0 for femaie), ethnicity (1 for 
African American and 0 for Non-African American), 
historicai biack coiieges and universities status (1 for 
HBCU graduate and 0 for Non-HBCU graduate), medicai 
schooi curricuium track (iabeied as curr_grp for the SPSS 
command, 1 for four-year curricuium track and 0 for five- 
year curricuium track). 

This Cox regression modei was constructed by means 
of the forward seiection procedure. At each step, the risk 
factor with the smaiiest observed significance ievei of the 
Waid statistic was entered into the modei. The defauit p 
vaiue of .05 was the entry criterion for the risk factors. 
Next, aii risk factors in the modei were examined to see 
if they met the defauit removai criterion (p=.1 0). The Waid 
statistics for aii risk factors in the modei were examined, 
and the risk factor with the iargest observed significance 
ievei for the Waid statistic was removed from the modei. 
If there were no risk factors that met removai criterion, the 
next eiigibie risk factor was entered into the modei. This 
process continued untii no risk factors met entry or removai 
criterion. 

SPSS PC Commands for Cox Regression Analysis 

The 1 2 steps of the SPSS PC Version 12.0 commands 
required to perform Cox regression anaiysis are as foiiows: 
Step 1 - Ciick Analyze, ciick Survival, and ciick Cox 
Regression- Step 2 - Ciick on time variabie (month), and 
ciick <right arrow> sign to move it to the time box; Step 
3 - Ciick on status variabie (difficult) and ciick <right 
arrow> sign to move it to the status box; Step 4 - Ciick the 
Define Event button, key in 1 in the singie vaiue box, and 
ciick Continue-, Step 5 - Ciick aii risk factors (ung_bsa, 
ung gpa, mcat_vr, mcat_ps, mcat_bs, gender, ethnic, 



hbcu, course2f, freshjgp, and loan_amf) and ciick <right 
arrow> sign to move it to the covariates box; Step 6 - Ciick 
the method options and seiect Forward-Wald-, Step 7 - 
Ciick on stratifying variabie (curr_grp) and ciick <right 
arrow> sign to move it to the strata box; Step 8 - Ciick the 
category option, ciick <right arrow> sign to move the 
covariates (gender, ethnic, hbcu) to the categoricai 
covariates box, and ciick Continue; Step 9 -Ciick the Plots 
button, seiect /-/azard piots, and ciick Continue-, Step 10 
-Ciick the Opf/bn button, seiect dispiay modei information; 
Step 1 1 - seiect Display Baseline Function, and ciick 
Continue-, and Step 12 - Ciick OK. Note that Step 12 - 
Ciick Paste to generate COXREG syntax command iines 
as foiiows: COXREG month /STATUS=difficui(1 )/ 
STRATA=curr_grp/CONTRAST(gender)= Indicator/ 
CONTRAST(ethnic)=lndicator/CONTRAST(hbcu)=lndicator 
/METHOD=FSTEP (WALD) ung_bsa ung_gpa mcat_vr 
mcat_ps mcat_bs gender ethnic hbcu course2f fresh_gp 
ioan_amt /PLOT HAZARD /PRINT=SUMMARY BASELINE 
/CRITERIA= PIN (.05) POUT(.IO) ITERATE(20). 

Major Findings for Cox Regression Analysis 

Data were analyzed to examine the relationship between 
the risk factors and the hazard of a student experiencing 
academic difficulty. The curriculum track demonstrated its 
significant contribution to the hazard function ((3=-1.33, 
p<.001 , and ei^=0.265), and the log-minus log (LML) plot 
of the survival functions appeared not to be parallel in the 
first run of Cox regression analysis. These findings support 
evidence of the violation of the proportional hazards 
assumption. Therefore, the stratified Cox regression model, 
stratifying on curriculum track, was implemented. No 
strong evidence of the violation of proportionality was 
found based on the parallel pattern of the LML plot of the 
survival curves. 

As illustrated in the footnote of Table 4, the model fits 
the data quite well based on the model chi-square test 
(X^=39.09, df=3, and p<.001) and the small value of the 
-2 log likelihood (280.24). Also, the improvement chi 
square (x^=44.60, df = 3, and p<.001) indicated that the 
model fits the data better than it had initially. Note that the 
initial -2 log likelihood was 324.84 in which the model 
contained only the constant term. Because the three 
explanatory variables were added in the model, the -2 log 
likelihood became 280.24, a decrease (improvement) of 
44.60 units. 

In this study, the following eight risk factors were not 
significantly associated with the hazard of a student 
experiencing academic difficulty: undergraduate basic 
sciences average, undergraduate GPA, MCAT physical 
science, MCAT biological science, ethnicity, HBCU status. 
Medical school freshman GPA, and financial aid loan 
amount. However, the risk factor, MCAT verbal reasoning 
score, had the highest Wald statistic (32.5) and entered 
the model equation in the first step. This was followed by 
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the inclusion of two more risk factors — gender and number 
of sophomore courses failed. As shown in Table 4, the 
regression coefficients for MCAT verbal reasoning score 
and gender were significantly different from zero at the 
.001 significance level. Also, the regression coefficient for 
the number of courses failed during sophomore year was 
significantly different from zero at the .05 significance 
level. 

It was evident that these three risk factors — MCAT 
verbal reasoning score, gender, and number of courses 
failed in sophomore year — were significantly associated 



Table 4 

Cox Regression Model for Students 
Experiencing Academic Difficulty 



Variables in 
the Equation 


Logistic 
Regression 
Coefficient (3) 


Standard 
Error of 3 SE (3) 


Odds Ratio (e^) 


MCAT Physical 

Reasoning Score 


-0.68*** 


0.119 


0.51 


Gender 

(1 for male; 0 for female) 


-1.05*** 


0.377 


0.35 


Number of Sophomore 
Courses Failed 


0.75* 


0.309 


2.11 


*p < .05 and p < .001 based on the Wald test [X^ = (B/SE(f3))2 with df=1] 

N=200 

-2 Log Likelihood: X2=280.24 

Model Chi Square Test: X^=39.09, df=3, and p<.001 

Improvement Chi Square Test: X^=44.60, df=3, and £<.001 



with academic difficulty. The hazard ratio was used to 
express the measure of the effects of these risk factors. 
For instance, the number of courses failed in the 
sophomore curriculum exhibited the hazard ratio of 2.1 1 . 
It indicated that a student who failed an additional course 
had about a two times greater hazard of experiencing 
academic difficulty as opposed to a student who did not. 
On one hand, the hazard ratio (0.35) for gender 
demonstrated that male students had 0.35 times greater 
hazard of experiencing academic difficulty compared to 
female students. Because the hazard ratio of 0.35 was a 
positive fraction and less than one, it is more meaningful 
to interpret that male students had almost three times 
(inverse of 0.35) less of a hazard than female students to 
experience academic difficulty. On the other hand, the 
hazard ratio (0.51) concerning MCAT verbal reasoning 
indicated that a one-unit increase in the MCAT verbal 
reasoning score was associated with a decrease in relative 
risk of academic difficulty by a factor of two (inverse of 
0.51). 

The hazard function h(t, X) against time (t) was plotted 
graphically based on the stratified Cox regression model. 
Figure 1 displays the hazard curves showing the four-year 
curriculum track is lower than the five-year curriculum 
track. This suggested that students in the four-year 
curriculum track were less likely to experience academic 
difficulty. According to the pattern of the hazard function, 
students in the five-year curriculum track experienced 



academic difficulty starting at the 20th month (first semester 
of the sophomore year), peaked at the 30th month (second 
semester of the sophomore year), and maintained the 
same level of hazard through the rest of the study period. 
The implication of this research finding was that the 
medical school should focus on academic support for 
students in the five-year curriculum track to address this 
extended period of academic difficulty. 

Figure 1 

Hazard Curves for Students 
Experiencing Academic Difficuity 



Months 

Simiiarities between Logistic and Cox 
Regression Modeis 

As illustrated in Table 5, logistic and Cox regression 
models share several common characteristics. The model 
similarities are mainly reflected in the principle of modeling 
strategies, the objective of regression models, the method 
of parameter estimations, and the procedure of variable 
selections. 

In the principle of modeling strategies, both models 
include all relevant explanatory variables at the initial 
stage of model fitting and achieve parsimony and 
consistency at the completion stage of model fitting. All 
relevant explanatory variables can be included in a single 
model with both methods as well because multiple effects 
can be simultaneously studied, and the effect of individual 
explanatory variables can be examined while others are 
held constant. Moreover, when fewer explanatory variables 
are sufficient to explain the occurrence of the event, 
investigators do not need elaborate explanations and 
unnecessary variables in models. Furthermore, 
investigators need to demonstrate the consistency of the 
model structure. It is important that significant explanatory 
variables and the effects of these variables are as identical 
as possible when the models are constructed over time. 

Another characteristic that the two models have in 
common is the study objective. Both logistic and Cox 
regression models are applicable to the occurrence of the 
binary outcome or critical event. They are designed to 
study the relationship between certain student learning 
outcomes and their relevant explanatory variables. The 
logistic regression model allows investigators to identify 
explanatory variables that significantly contribute to the 
probability of a student obtaining a binary outcome while 
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the Cox regression model allows investigators to identify 
risk factors that significantly contribute to the hazard 
function of a student experiencing the critical event. 

Both methods use the maximum likelihood estimation 
technique to estimate the regression coefficients and to 
construct the non-linear model equations. The principle of 
maximum likelihood is basically the same, although the 
Cox regression model uses the partial likelihood estimation 
(considering probabilities for the event occurrence rather 
than censored cases). The maximum likelihood estimation 
technique is applied to estimate the regression coefficients 
such that the likelihood of observing data is maximized. 
This technique has been discussed in many statistical 
books (Eliason, 1993; Hosemer and Lemeshow, 1989; 
Kleinbaum, 1 996; and Pampel, 2000). First, the probability 
distribution (i.e., model equation) of students obtaining the 
outcome is prepared. Secondly, the joint probability 
distribution is derived based on the product of probability 
distributions under the assumption of independence. Thirdly, 
logarithm transformation is applied to the joint probability 
distribution to yield the log likelihood functions. Lastly, the 
iterative procedure is undertaken to estimate the regression 
coefficients. The iterative procedure begins with the value 
for individual parameters in the log likelihood functions, 
followed by a cycle of adjusting initial values to improve the 
model fitting. This procedure ultimately achieves the 
maximum likelihood estimates of the regression coefficients. 

Logistic and Cox regression analyses use the same 
procedures such as enter, forward, and backward elimination 
to select significant explanatory variables (SPSS Inc, 2002). 
These procedures are briefly described as follows: (a) 
Enter procedure — All explanatory variables are forced to 
be included in the model in one step; (b) Forward stepwise 
procedure — Explanatory variables are included in the model 
one at a time based on the highest Wald or the likelihood- 
ratio statistics with the entry criterion (p=.05). As each 
new variable is added to the model, all of the existing 
explanatory variables in the model are evaluated for removal 
based on the Wald or likelihood-ratio test with the removal 
criterion (p=.10) When no more variables meet the entry or 
the removal criteria, the algorithm of selecting significant 
variables stops; and (c) Backward procedure — All of the 
explanatory variables are entered into the model during the 
first step. The explanatory variables that meet removal 
criterion (p =.10) are removed sequentially. When no more 
variables meet the removal criteria, the algorithm of selecting 
significant variables stops. 

In both models, investigators utilize the -2 log likelihood 
value as criterion to make a judgment concerning the 
significance of the explanatory variables. The -2 log 
likelihood (-2LL), a chi square statistic, is important because 
it tells the probability of obtaining the binary outcome given 
the established parameter estimates. It is also a measure 
of how well the estimated parameters fit the data; a small 
value of the -2 log likelihood means the model fits the data 



well. (Norusis, 1985). Both models allow investigators to 
perform the likelihood ratio test that compares the likelihood 
for the intercept only model to the likelihood for the model 
containing the explanatory variables within each analysis. 
If the p value is less than the predetermined significance 
level (a=.05, .01 , or .001 ), investigators may claim that at 
least one of the explanatory variables or risk factors in the 
model significantly contribute to the outcome variable. An 
additional similarity of the two models includes the use of 
the Wald statistic as criterion to test the association 
between individual explanatory variable and the outcome 
variable. If the p value for the Wald test is less than the 
predetermined significance level, investigators may 
conclude that a specific explanatory variable significantly 
contributes to the probability or the hazard function of 
experiencing a critical event. 

Logistic and Cox regression models allow investigators 
to interpret the effect (odds ratio and hazard ratio) of 
specific explanatory variables on the outcome variable. 
The interpretations of the odds ratio and hazard ratio are 
the same, although the calculations are quite different. 
Logistic regression analysis uses the odds ratio (e^) to 
indicate that an average one-unit of change in the 
explanatory variable leads to a change in the odds of a 
student obtaining a binary outcome by a factor of e^. Cox 



Table 5 

Summary of Similarities between 
Logistic and Cox Regression Models 



Similarities of Two Models 


Logistic and Cox Regression Analyses 


Principle of Modeling Strategies 


Inclusion of all relevant explanatory variables or risk 
factors at the initial stage of the model fitting; and 
achieving parsimony and consistency upon the 
completion stage of the model fitting 


Objective of Regression Models 


Logistic Regression: 

To identify explanatory variables (X) that significantly 
contribute to the probability, P(X), of student 
obtaining a binary outcome 

Cox Regression; 

To identify risk factors (X) that significantly 
contribute to the hazard function h(t,X) for the 
duration and timeline of a student experiencing the 
critical event 


Method of Parameter Estimations 


The principle of the maximum likelihood estimation 
is applied to estimate the regression coefficients 


Procedure of Variable Selections 


Enter, forward, or backward procedures are used to 
select significant explanatory variables or risk 
factors to form the regression model 


Test of Significant Predictors 


Similar to the F test in linear regression, both 
models use minus two log likelihood (-2LL) test for 
the significance model fitting (All explanatory 
variables do not contribute to the outcome 
occurrence vs. At least one of the explanatory 
variables contribute to the outcome occurrence) 
Similar to the t test in linear regression, both 
models use the Wald test for the significance of the 
individual regression coefficients. 


Interpretation of Magnitude Effects 


Logistic Regression: 

To provide insight into the measure (odds, odds 
ratio, e<'^° w) of the effects of the 

explanatory variables 

Cox Regression: 

To provide insight into the measure (relative hazard, 
hazard ratio, or e* of the effects of 

risk factors 



13 



IR Applications, Number 5, Analyzing Student Learning. . . 



regression analysis utilizes the hazard ratio (e^) to indicate 
that an average one-unit of change in the risk factor 
contributes to a change in the hazard of a student 
experiencing the critical event. 

Differences between Logistic and 
Cox Regression Modeis 

As described in Table 6, the distinct characteristics of 
logistic and Cox regression models can be distinguished 
by the formation of research questions, the expression of 
model equations, the assessment of model fittings, and 
the verification of model assumptions. 

Logistic regression analysis focuses on a critical event 
occurring or not occurring. Examples of research questions 
formed include “Will the student experience the occurrence 
of the critical event, ‘yes’ or ‘no’?” and “What explanatory 
variables contribute to the occurrence of the critical event, 
‘success’ or ‘failure’?” However, Cox regression analysis 
is concerned with the duration and timing of the occurrence 
of the event. The research questions may be framed as 
“At what time (month, semester, year) does the student 
experience the critical event at the highest risk?”, and 
’’What risk factors contribute to the occurrence of the 
critical event at different time of the study periods?” 

A key difference between logistic and Cox regression 
analyses is to use distinct model equations to study the 
relationship between the binary outcome and the 
explanatory variables. The logistic regression model is 
written as P(X) = e^ / (1-re^), where Z = ^^ + (3^X, -i- p^X^ 
-I-.. .-I- PpXp, and P(X) is the probability of obtaining a binary 
outcome. The Cox regression model may be expressed 
as h(t,X)=hp(t)e^ where Z= p^X,+ p^X^ +...+ p^X^. The 
hazard function, h(t,X), is a function of time (t), the risk 
factors (X), and the baseline hazard hp(t). The baseline 
hazard is dependent on the time variable, acting as a 
constant term and contributing to the hazard function in 
a multiplicative manner. 

A noteworthy difference between the two models is 
the pattern of non-linear curves. In the logistic regression 
model, the main focus is to estimate the probability of 
obtaining the binary outcome. The estimated probability 
is a continuous measure that begins with zero and 
increases as a smooth S-shaped curve. The value of the 
probability is between zero and one depending on the 
explanatory variables and the regression coefficients. 
However, for the Cox regression model, the hazard function 
is a continuous variable dependent on the duration and 
timeline of experiencing the critical event. The hazard 
function is a rate, which is ranged from zero to positive 
infinity. It begins with any positive value and goes up and 
down depending on the product of the baseline hazard 
and the risk factors. 

Logistic regression, unlike the Cox regression model, 
has the capability of assessing the prediction power of the 
model and performing future predictions. In the logistic 



regression model, the prediction results can be used as 
criteria to make a judgment regarding accuracy of the 
classifications. The cross-tabulating method is used to 
categorize the predicted and the actual responses into a 
2 by 2 table that indicates the accuracy of the classification 
results. However, in the Cox regression model, the 
classification results are not readily available to assess 
the model accuracy. Although the logistic regression 
model can be used to perform future predictions based on 
the known explanatory variables for prospective students, 
the Cox regression model is not capable of performing 
such predictions. On the right side of the Cox regression 
equation, all risk factors (X) for prospective students are 
the observed values that are ready to be placed into the 
equation. However, the time variable (t) in the baseline 
hazard h^(t) for prospective students is unknown, therefore 
it cannot be placed into the equation to perform future 
predictions. 

The availability of pseudo R-squares also differs between 
the two models. The pseudo R-square measures the 
success of the model in explaining the variations in the 



Table 6 

Summary of Differences between Logistic and 
Cox Regression Models 



Differences of 
Two Models 


Logistic Regression Analysis 


Cox Regression Analysis 


Formulation of 
Research 
Questions 


Will the critical event occur (yes or no- 
the binary outcome)? What 
explanatory variables contribute to the 
occurrence of the critical event? 


When (at what time) will the critical event 
occur (yes or no-the binary outcome)? 
What risk factors contribute the timeline 
of the occurrence of the critical event? 


Expression of 
Model 
Equations 


P(X)=e^/(1 + e^ ), where P(X) is the 
probability of event occurrence: e is 
the base of the natural logarithm: Z = 
Bq -f B,X,-f- BjXj-t-...-!- BpXp: Bs are 
regression coefficients: and Xs are 
explanatory variables 


h(t,X) = h),(t) e^, where h(t,X) is hazard 
function of the event occurrence given 
time variable (t): l\(t) is baseline hazard: 
e is the base of the natural logarithm: Z 
= B,X,+ B2X2+...+ BpX^: Bs are 
regression coefficients: and Xs are risk 
factors 


Pattern of Non- 
Linear Curves 


P(X) is a continuous measure 
(probability of a student experiencing 
the critical event such as pass/fail, 
achiever/nonachiver) 

P(X) begins with zero and increases 
as a smooth S-shaped curve. P(X) is 
the probability value between zero and 
one depending on the explanatory 
variables and the regression 
coefficients 

P(X) cannot be used to detect the 
timeline of the occurrence of the 
critical events. 


h(t, X) is a continuous measure (hazard 
function for the timeline of a student 
experiencing the critical event such as 
dropout, dismissal, and withdrawal) 

The hazard function h(t,X) begins with 
any positive value and goes up and 
down. It is a rate between zero and 
positive infinity depending on the product 
of baseline hazard and the function of 
risk factors. 

h(t, X) can be used to detect the timeline 
of the occurrence of the critical events. 


Assessment of 
Accurate 
Predictions 


Explanatory variables are the 
observed values readily to be plugged 
into the equation to perform 
predictions 

The prediction results allow 
investigators to identify potential at- 
risk students to participate in the 
mandatory intervention program. 

The classification results are available 
to assess prediction accuracy. 


Risk factors are the observed values 
readily available for the predictions. 
However, the time variable (t) in the 
baseline hazard function ho{t) is 
unknown value that cannot be plugged 
into the equation to perform predictions. 

The classification results are not 
available to assess prediction accuracy. 


Assessment of 
Model Fittings 


Pseudo R-square is readily available 
to assess the model fittings. 


Pseudo R-square is not available to 
assess the model fitting. 


Verification of 
Model 

Assumptions 


Residuals are normally distributed with 
a mean of zero and a constant 
variance. The histogram and 
scattergram of residuals are plotted to 
check for the normality and the 
homogeneity of variance. 


Hazard ratio are constant across time. If 
the hazards are proportional, the survival 
curves generated by log minus log (LML) 
should be parallel. 
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data, which means it indicates that the proportion of 
variations in the outcome variabie is accounted for by the 
expianatory variabies. In the iogistic regression modei, the 
pseudo R-squares are used as criteria to assess the 
modei goodness of fit. However, in the Cox regression 
modei, the pseudo R-squares are not readiiy avaiiabie to 
assess the modei fitting in SPSS PC Version 12.0 
commands. 

Finaiiy, another key difference between the two modeis 
iies in the modei assumptions. For the iogistic regression 
modei, residuais are assumed to have a mean of zero and 
a variance of P(X) [1 -P(X)]. Investigators must check for 
vioiation of the assumption by piotting the histogram and 
scatter diagram for residuais. The modei assumption is 
satisfied if (1) the histogram of the residuais is normaiiy 
distributed with a mean of zero and (2) the residuais on the 
scatter diagram appear to be paraiiei with the X-axis, (i.e., 
the indication of a constant variance). The Cox regression 
modei states that there is a muitipiicative reiationship 
between the baseiine hazard function h^(t) and the function 
of the risk factors. As a resuit, the ratio of the hazard 
functions for an event with different vaiues for the risk 
factor does not depend on time (t). Therefore, investigators 
need to check for vioiation of this assumption by using the 
iog-minus iog (LML) piot of the survivai function. If the 
hazards are proportional, the survival curves generated by 
LML should be parallel (Steinberg, 1999). Under the 
assumption of proportional hazards, the resulting curves 
should be parallel and separated only by a constant 
vertical difference. 

Summary 

In this study, the main objectives for constructing logistic 
and Cox regression models were accomplished. For logistic 
regression analysis, the explanatory variables contributing 
to the probability of a student passing the USMLE Step 1 
were identified. It was evident that the MCAT (physical 
and biological sciences) scores, number of sophomore 
courses failed, and medical school freshman GPAs were 
significantly associated with the USMLE Step 1 
performances. The study results confirmed that MCAT 
scores and medical school course performances were 
significant predictors of the USMLE Step 1 (Chen et al., 
2001 ; and Haught and Walls, 2002). The implication of the 
study results was that the medical school should continue 
its effort to recruit and admit qualifying students with high 
MCAT scores, and to strengthen teaching and learning to 
ensure student success on the licensure examination. 

With regard to Cox regression analysis, the method 
indicated that academic difficulty was significantly 
accounted for by risk factors such as MCAT verbal 
reasoning score, gender, and number of sophomore courses 
failed. Moreover, students in the five-year curriculum track 
experienced academic difficulty during the first semester 
of their sophomore year, peaked at the second semester 



of the sophomore year and maintained the same level of 
risk through the rest of the study period. The research 
results were consistent with the literature stating that an 
increase in the relative risk for a student experiencing 
academic difficulty was significantly associated with a low 
MCAT score (Huff and Fang, 1999), and students at risk 
for academic difficulty remained at risk throughout the first 
three years of medical school (Fang, 2000). The implication 
of this study was that the medical school addressed 
academic difficulty issues through academic development 
and support services. 

Strengths 

Both logistic and Cox regression models are typically 
used for data analysis concerning binary outcomes such 
as admitted/not admitted, enrolled/not enrolled, and 
graduated/not graduated. In particular, the logistic 
regression method is capable of allowing investigators to 
answer some important questions linked to learning 
outcomes”ls student performance likely to improve or not 
improve after the implementation of tutorial and remedial 
programs?” ’Are student graduation and attrition status 
significantly associated with the explanatory variables 
concerning student characteristics and college learning 
environment?” and "Can the probability of students 
expressing overall college satisfaction be estimated by 
certain explanatory variables concerning academic 
programs and services?” 

In Cox regression analysis, the hazard function is a 
function ofthetime-to-event and risk factors. This function 
provides investigators with the valuable information to 
answer certain learning outcome questions such as "How 
many semesters elapse before students experience 
academic difficulty?” ’’What risk factors are significantly 
associated with the occurrence of academic difficulty?” 
”At what month are students likely to pass the licensure 
examination?” ’’What explanatory variables significantly 
contribute to the success of licensure examination?” ”How 
many years pass before students graduate from the 
college?” and ”To what extent student’s timely and delayed 
graduation are accounted for by the explanatory variables 
concerning the overall quality of educational program and 
student support?” 

Clearly, these two models are proven to be useful tools 
in studying explanatory variables that are significantly 
associated with binary outcomes. Moreover, the effect of 
a specific explanatory variable or risk factor on the event 
occurrence can be investigated holding other explanatory 
variables or risk factors constant. Both methods also 
allow investigators to assess the model fittings by means 
of the likelihood ratio test. Furthermore, using the logistic 
regression model, investigators can perform classifications, 
and subsequently evaluate the predictive power. 
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Limitations 

This study is not compieteiy conciusive because of 
some methodoiogicai iimitations. Forexampie, given both 
modeis were buiit to study the effects of the expianatory 
variabies on the binary outcomes — passing iicensure 
examination (yes or no) or experiencing academic difficuity 
(yes or no) — the investigator is unabie to use the same 
techniques to study muitipie outcome categories. These 
outcomes categories inciude three pass- or fail-groups of 
licensure examination (e.g., first-time pass, second-time 
pass, and fail at least two times) and three categories of 
academic difficulty (e.g., dismissal, withdrawal, and leave 
of absence), respectively. 

The Cox regression model is also known as the Cox 
proportional hazards model and has to satisfy the model 
assumption of proportional hazards. If the model 
assumption is seriously violated, the analysis results 
could be inaccurate and misleading. However, it is difficult 
to make judgments about the proportional hazards on the 
LML plot of survival curves partly because the functions 
are non-linear instead of straight lines. In other words, the 
validity of the model assumption may not be properly 
evaluated because of the limitations of the LML 
assessment tool. 

Major Alternatives 

To study the outcome variable with multiple categories 
as a function of the explanatory variables, polytomous 
logistic regression analysis is a possible alternative (Peng, 
et al., 2002). Polytomous logistic regression analysis 
includes ordered and multinomial logistic regression 
models. The ordered logistic regression model is applicable 
to three or more ordinal outcome categories (e.g., first- 
time pass, second-time pass, and at least two-time fail 
groups for licensure examination). The model is called the 
cumulative logit model because it is based on the 
cumulative response probabilities of being in a category or 
lower (Walters, et al., 2001 ). The model is also called the 
proportional odds model because it assumes that the 
corresponding regression coefficients in the link function 
are equal for each explanatory variable. Therefore, the 
model assumption has to be verified carefully by the 
parallel lines test (SPSS, Inc., 2002). If the model 
assumption is satisfied, investigators can proceed to 
interpret the effects of the explanatory variables. 

If the model assumption is violated in the ordered 
logistic regression analysis, the multinomial logistic 
regression model should be considered as another 
alternative. The outcome variable of multiple nominal groups 
includes the reference group and the target groups. For 
instance, in passing licensure examination, the first-time 
pass group is labeled as the reference group, the second- 
time pass group is coded as target group 1 , and at least 
two-time fail group is considered as target group 2. Two 
model equations are generated for the nominal outcome 



with the three groups. In addition, the two sets of relative 
risk rates are calculated when the probability of a student 
falling into a specific target group is compared to that of 
a student in the reference group (Walters, Campbell, and 
Lall, 2001). The multinomial logistic regression analysis 
relaxes the model assumption of the proportional odds. It 
does not require that investigators verify the assumption of 
parallel lines because the relationship between the 
explanatory variables and the effects of these variables 
depends on the outcome category (Plank and Jordan, 
1997). 

An implicit feature of the Cox regression model is 
reflected in the model assumption of proportional hazards 
for all time intervals. If the LML plot of survival curves for 
assessing the validity of the model assumption is not 
successful, the smoothed plots of the scaled Schoenfeld 
residuals proposed by Therneau and Grambsch (Belle, et 
al, 2004) may be a better approach. The Schoenfeld 
residual method not only provides the investigator with an 
easier visual interpretation, it also offers a statistical test 
for the proportional hazards assumption (Belle, et al., 
2004). An additional alternative for detecting the violation 
of the model assumption is the likelihood ratio test (Palmer, 
et al., 2003). Departure from the assumption of proportional 
hazards can be analyzed by the likelihood ratio test 
comparing models with and without the stratifying variable 
by the covariate interaction terms (e.g., the curriculum 
track by the three explanatory variables — MCAT verbal 
reasoning score, gender, and the number of sophomore 
courses failed, respectively, in the present study). If no 
statistical significance is found in these interaction terms, 
then the model assumption of proportional hazards is not 
violated. 

In the Cox regression model, if the effects of the 
explanatory variables change during time in a practical 
situation (e.g., age and financial aid amount fluctuate over 
years), it may lead to a violation of the model assumption. 
For this reason, the extended Cox regression model that 
utilizes the time-dependent variables should be considered 
as an alternative to analyze data for that do not require the 
model assumption (Klein and Moeschberger, 1997; and 
Kleinbaum, 1996). The extended Cox regression model 
allows investigators to study the effect of the explanatory 
variables along with the time-dependent variables on the 
hazard function. 

Editor’s Notes 

Recently there has been an increased interest in using 
various methodologies that better fit the situations we 
encounter. While Multiple Linear Regression with its 
Ordinary Least Squares has been part of our methodology 
for many years, it has always had certain limitations. For 
example in predicting probabilities it has the unfortunate 
characteristic of going both negative and also going greater 
than one. Logistic Regression deals with this by creating 
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a dependent variable, the log odds, that can range from 
negative to positive and that can exceed 1.0. While 
Logistic Regression has been discussed in the statistical 
literature for numerous years, it has become more prevalent 
in institutional research during the last several years. 

A second use of Linear Regression has been to explain 
the time between a process starting and completing. Here 
again the linearity of relationships tends to be suspect. In 
addition the regression model is not able to handle data 
where specific completion date is not available. Cox 
Regression is capable of dealing with both of these issues. 
As with logistic regression, it uses a function of the 
exponential and converts the relationship into one that is 
linear in logarithms. 

While both the Logistic and Cox procedures are part of 
many standard statistical software, such as SPSS, 
correctly using them is frequently not intuitive. This is 
where the article by Dr. Chen provides an extremely 
valuable service. First it provides a good summary of 
when and why a researcher would want to use these types 
of models. In addition it provides an example of their use. 
In fact it gives you the step-by-step procedures needed to 
run your own analysis in SPSS. Next it gives a brief 
interpretation of the results. 

What I think you will find most exciting however is Table 
6 where there is a head to head comparison of the two 
methodologies. As an exercise to consider when linear 
regression should be used, the reader may well want to 
add an additional column to Table 6, title the column 
Linear Regression, and fill in the cells. For example under 
Pattern of Non-Linear Curves, one might want to list the 
use of various nonlinear transformations such as quadratic 
and cubic terms. The reader might also want to add rows 
to Table 6 such as Interpretation of Regression Weights. 
The interpretation of these weights for the Logistic and 
Cox procedures can be found in the text of the article. 

This suggestion also applies to Table 5 where you may 
want to put in the similarities that Multiple Linear Regression 
has with Logistic and Cox Regression. 

Also found in this article is the discussion of the 
relationships between Hazard and Survival functions. As 
such this paper represents an excellent first step or primer 
and explains techniques that greatly extend our analytical 
methodologies. Readers do need to be warned, however, 
that if they are interested in using these techniques they 
must focus on learning more about them. For example 
when using Chi Square and testing within stepwise or 
nested models, it is extremely important to select 
appropriate models that are theoretically relevant. The 
selection of the specific stratification of the model, where 
stratification is desirable, is also an extremely important 
element in the methodology. In terms of building the 
models. Dr. Chen selected Forward-Wald to build his 
survival model. There are other options such as 
simultaneously entering the variables or backwards 



elimination of variables The decision to use a specific 
strategy should be selected based on the situation and 
the intent of the researcher. 

Another option the researcher has in Logistic Regression 
is to adjust the cut-point for classifying and observation 
into the two categories. Dr. Chen used the default of .5 but 
you may want to use a proportion that’s closer to the 
actual split in the sample. 

This article will give you an extremely good start in 
appropriately using two alternatives to the traditional 
regression methodology. As you begin to use one or both 
of the alternatives, the references he provides contain the 
information that will be essential for you to use the 
methodologies appropriately. 
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